Exploratory Data Analysis (EDA)¶

Obsah¶

  • Dataset so syntetickými dátami

    • Chýbajúce hodnoty, dátové typy, duplikáty a deskriptívna štatistika
    • Vizualizácia dát
    • Záver
    • Korelácia
  • Dataset s reálnymi dátami

    • Chýbajúce hodnoty, dátové typy, duplikáty a deskriptívna štatistika
    • Vizualizácia dát
    • Záver
    • Predspracovanie dát
  • Výber metrík

  • Záver

  • Referencie

In [126]:
import json
import math
import warnings

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFE, RFECV, SequentialFeatureSelector
from sklearn.inspection import permutation_importance
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.utils.class_weight import compute_class_weight

# Ignore warnings
warnings.filterwarnings("ignore")

Dataset zo syntetickej premávky ¶

In [127]:
data = pd.read_csv("../datasets/simulated_network_data.csv")
data.head()
Out[127]:
timestamp amf_session_value bearers_active_value fivegs_amffunction_amf_authreject_value fivegs_amffunction_amf_authreq_value fivegs_amffunction_mm_confupdate_value fivegs_amffunction_mm_confupdatesucc_value fivegs_amffunction_mm_paging5greq_value fivegs_amffunction_mm_paging5gsucc_value fivegs_amffunction_rm_regemergreq_value ... process_resident_memory_bytes_value process_start_time_seconds_value process_virtual_memory_bytes_value process_virtual_memory_max_bytes_value ran_ue_value s5c_rx_createsession_value s5c_rx_parse_failed_value application log_type current_uc
0 2025-04-11 14:41:57 4.0 4.0 0.0 4.0 4.0 0.0 0.0 0.0 0.0 ... 52657356.8 3.644742e+08 1.151508e+09 -1.0 0.0 0.0 0.0 0 0 uc6
1 2025-04-11 14:41:58 4.0 4.0 0.0 4.0 4.0 0.0 0.0 0.0 0.0 ... 52657356.8 3.644742e+08 1.151508e+09 -1.0 0.0 0.0 0.0 0 0 uc6
2 2025-04-11 14:41:59 4.0 4.0 0.0 4.0 4.0 0.0 0.0 0.0 0.0 ... 52657356.8 3.644742e+08 1.151508e+09 -1.0 0.0 0.0 0.0 0 0 uc6
3 2025-04-11 14:42:00 4.0 4.0 0.0 4.0 4.0 0.0 0.0 0.0 0.0 ... 52657356.8 3.644742e+08 1.151508e+09 -1.0 0.0 0.0 0.0 0 0 uc6
4 2025-04-11 14:42:01 4.0 4.0 0.0 4.0 4.0 0.0 0.0 0.0 0.0 ... 52657356.8 3.644742e+08 1.151508e+09 -1.0 0.0 0.0 0.0 0 0 uc6

5 rows × 58 columns

Chýbajúce hodnoty, Dátové typy, Duplikáty a Deskriptívna štatistika ¶

In [128]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43791 entries, 0 to 43790
Data columns (total 58 columns):
 #   Column                                              Non-Null Count  Dtype  
---  ------                                              --------------  -----  
 0   timestamp                                           43791 non-null  object 
 1   amf_session_value                                   43791 non-null  float64
 2   bearers_active_value                                43791 non-null  float64
 3   fivegs_amffunction_amf_authreject_value             43791 non-null  float64
 4   fivegs_amffunction_amf_authreq_value                43791 non-null  float64
 5   fivegs_amffunction_mm_confupdate_value              43791 non-null  float64
 6   fivegs_amffunction_mm_confupdatesucc_value          43791 non-null  float64
 7   fivegs_amffunction_mm_paging5greq_value             43791 non-null  float64
 8   fivegs_amffunction_mm_paging5gsucc_value            43791 non-null  float64
 9   fivegs_amffunction_rm_regemergreq_value             43791 non-null  float64
 10  fivegs_amffunction_rm_regemergsucc_value            43791 non-null  float64
 11  fivegs_amffunction_rm_reginitreq_value              43791 non-null  float64
 12  fivegs_amffunction_rm_reginitsucc_value             43791 non-null  float64
 13  fivegs_amffunction_rm_registeredsubnbr_value        43791 non-null  float64
 14  fivegs_amffunction_rm_regmobreq_value               43791 non-null  float64
 15  fivegs_amffunction_rm_regmobsucc_value              43791 non-null  float64
 16  fivegs_amffunction_rm_regperiodreq_value            43791 non-null  float64
 17  fivegs_amffunction_rm_regperiodsucc_value           43791 non-null  float64
 18  fivegs_ep_n3_gtp_indatapktn3upf_value               43791 non-null  float64
 19  fivegs_ep_n3_gtp_outdatapktn3upf_value              43791 non-null  float64
 20  fivegs_pcffunction_pa_policyamassoreq_value         43791 non-null  float64
 21  fivegs_pcffunction_pa_policyamassosucc_value        43791 non-null  float64
 22  fivegs_pcffunction_pa_policysmassoreq_value         43791 non-null  float64
 23  fivegs_pcffunction_pa_policysmassosucc_value        43791 non-null  float64
 24  fivegs_pcffunction_pa_sessionnbr_value              43791 non-null  float64
 25  fivegs_smffunction_sm_n4sessionestabreq_value       43791 non-null  float64
 26  fivegs_smffunction_sm_n4sessionreport_value         43791 non-null  float64
 27  fivegs_smffunction_sm_n4sessionreportsucc_value     43791 non-null  float64
 28  fivegs_smffunction_sm_pdusessioncreationreq_value   43791 non-null  float64
 29  fivegs_smffunction_sm_pdusessioncreationsucc_value  43791 non-null  float64
 30  fivegs_smffunction_sm_qos_flow_nbr_value            43791 non-null  float64
 31  fivegs_smffunction_sm_sessionnbr_value              43791 non-null  float64
 32  fivegs_upffunction_sm_n4sessionestabreq_value       43791 non-null  float64
 33  fivegs_upffunction_sm_n4sessionreport_value         43791 non-null  float64
 34  fivegs_upffunction_sm_n4sessionreportsucc_value     43791 non-null  float64
 35  fivegs_upffunction_upf_qosflows_value               43791 non-null  float64
 36  fivegs_upffunction_upf_sessionnbr_value             43791 non-null  float64
 37  gn_rx_createpdpcontextreq_value                     43791 non-null  float64
 38  gn_rx_deletepdpcontextreq_value                     43791 non-null  float64
 39  gn_rx_parse_failed_value                            43791 non-null  float64
 40  gnb_value                                           43791 non-null  float64
 41  gtp1_pdpctxs_active_value                           43791 non-null  float64
 42  gtp2_sessions_active_value                          43791 non-null  float64
 43  gtp_new_node_failed_value                           43791 non-null  float64
 44  gtp_peers_active_value                              43791 non-null  float64
 45  process_cpu_seconds_total_value                     43791 non-null  float64
 46  process_max_fds_value                               43791 non-null  float64
 47  process_open_fds_value                              43791 non-null  float64
 48  process_resident_memory_bytes_value                 43791 non-null  float64
 49  process_start_time_seconds_value                    43791 non-null  float64
 50  process_virtual_memory_bytes_value                  43791 non-null  float64
 51  process_virtual_memory_max_bytes_value              43791 non-null  float64
 52  ran_ue_value                                        43791 non-null  float64
 53  s5c_rx_createsession_value                          43791 non-null  float64
 54  s5c_rx_parse_failed_value                           43791 non-null  float64
 55  application                                         43791 non-null  object 
 56  log_type                                            43791 non-null  object 
 57  current_uc                                          43791 non-null  object 
dtypes: float64(54), object(4)
memory usage: 19.4+ MB
In [129]:
data.describe(include='all')
Out[129]:
timestamp amf_session_value bearers_active_value fivegs_amffunction_amf_authreject_value fivegs_amffunction_amf_authreq_value fivegs_amffunction_mm_confupdate_value fivegs_amffunction_mm_confupdatesucc_value fivegs_amffunction_mm_paging5greq_value fivegs_amffunction_mm_paging5gsucc_value fivegs_amffunction_rm_regemergreq_value ... process_resident_memory_bytes_value process_start_time_seconds_value process_virtual_memory_bytes_value process_virtual_memory_max_bytes_value ran_ue_value s5c_rx_createsession_value s5c_rx_parse_failed_value application log_type current_uc
count 43791 43791.000000 43791.000000 43791.0 43791.000000 43791.000000 43791.0 43791.0 43791.0 43791.0 ... 4.379100e+04 4.379100e+04 4.379100e+04 43791.0 43791.000000 43791.0 43791.0 43791 43791 43791
unique 29749 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN 6 5 6
top 2025-04-11 19:28:55 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN 0 0 uc4
freq 140 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN 29014 29014 11619
mean NaN 3.925715 3.919938 0.0 104.852207 104.833801 0.0 0.0 0.0 0.0 ... 5.147614e+07 3.677348e+08 1.150896e+09 -1.0 1.934598 0.0 0.0 NaN NaN NaN
std NaN 0.428739 0.435187 0.0 64.203809 64.197099 0.0 0.0 0.0 0.0 ... 1.860685e+06 1.810127e+06 1.110534e+05 0.0 1.737978 0.0 0.0 NaN NaN NaN
min NaN 0.000000 0.000000 0.0 2.000000 2.000000 0.0 0.0 0.0 0.0 ... 4.636180e+07 3.644742e+08 1.150873e+09 -1.0 0.000000 0.0 0.0 NaN NaN NaN
25% NaN 4.000000 4.000000 0.0 39.000000 38.000000 0.0 0.0 0.0 0.0 ... 5.207409e+07 3.660509e+08 1.150873e+09 -1.0 0.000000 0.0 0.0 NaN NaN NaN
50% NaN 4.000000 4.000000 0.0 113.000000 113.000000 0.0 0.0 0.0 0.0 ... 5.221827e+07 3.660509e+08 1.150873e+09 -1.0 1.000000 0.0 0.0 NaN NaN NaN
75% NaN 4.000000 4.000000 0.0 159.000000 159.000000 0.0 0.0 0.0 0.0 ... 5.231002e+07 3.695255e+08 1.150873e+09 -1.0 4.000000 0.0 0.0 NaN NaN NaN
max NaN 4.000000 4.000000 0.0 204.000000 204.000000 0.0 0.0 0.0 0.0 ... 5.268357e+07 3.695255e+08 1.151508e+09 -1.0 4.000000 0.0 0.0 NaN NaN NaN

11 rows × 58 columns

In [130]:
data.isnull().sum()[data.isnull().sum() > 0]
Out[130]:
Series([], dtype: int64)
In [131]:
data.isnull().sum() / len(data) * 100
Out[131]:
timestamp                                             0.0
amf_session_value                                     0.0
bearers_active_value                                  0.0
fivegs_amffunction_amf_authreject_value               0.0
fivegs_amffunction_amf_authreq_value                  0.0
fivegs_amffunction_mm_confupdate_value                0.0
fivegs_amffunction_mm_confupdatesucc_value            0.0
fivegs_amffunction_mm_paging5greq_value               0.0
fivegs_amffunction_mm_paging5gsucc_value              0.0
fivegs_amffunction_rm_regemergreq_value               0.0
fivegs_amffunction_rm_regemergsucc_value              0.0
fivegs_amffunction_rm_reginitreq_value                0.0
fivegs_amffunction_rm_reginitsucc_value               0.0
fivegs_amffunction_rm_registeredsubnbr_value          0.0
fivegs_amffunction_rm_regmobreq_value                 0.0
fivegs_amffunction_rm_regmobsucc_value                0.0
fivegs_amffunction_rm_regperiodreq_value              0.0
fivegs_amffunction_rm_regperiodsucc_value             0.0
fivegs_ep_n3_gtp_indatapktn3upf_value                 0.0
fivegs_ep_n3_gtp_outdatapktn3upf_value                0.0
fivegs_pcffunction_pa_policyamassoreq_value           0.0
fivegs_pcffunction_pa_policyamassosucc_value          0.0
fivegs_pcffunction_pa_policysmassoreq_value           0.0
fivegs_pcffunction_pa_policysmassosucc_value          0.0
fivegs_pcffunction_pa_sessionnbr_value                0.0
fivegs_smffunction_sm_n4sessionestabreq_value         0.0
fivegs_smffunction_sm_n4sessionreport_value           0.0
fivegs_smffunction_sm_n4sessionreportsucc_value       0.0
fivegs_smffunction_sm_pdusessioncreationreq_value     0.0
fivegs_smffunction_sm_pdusessioncreationsucc_value    0.0
fivegs_smffunction_sm_qos_flow_nbr_value              0.0
fivegs_smffunction_sm_sessionnbr_value                0.0
fivegs_upffunction_sm_n4sessionestabreq_value         0.0
fivegs_upffunction_sm_n4sessionreport_value           0.0
fivegs_upffunction_sm_n4sessionreportsucc_value       0.0
fivegs_upffunction_upf_qosflows_value                 0.0
fivegs_upffunction_upf_sessionnbr_value               0.0
gn_rx_createpdpcontextreq_value                       0.0
gn_rx_deletepdpcontextreq_value                       0.0
gn_rx_parse_failed_value                              0.0
gnb_value                                             0.0
gtp1_pdpctxs_active_value                             0.0
gtp2_sessions_active_value                            0.0
gtp_new_node_failed_value                             0.0
gtp_peers_active_value                                0.0
process_cpu_seconds_total_value                       0.0
process_max_fds_value                                 0.0
process_open_fds_value                                0.0
process_resident_memory_bytes_value                   0.0
process_start_time_seconds_value                      0.0
process_virtual_memory_bytes_value                    0.0
process_virtual_memory_max_bytes_value                0.0
ran_ue_value                                          0.0
s5c_rx_createsession_value                            0.0
s5c_rx_parse_failed_value                             0.0
application                                           0.0
log_type                                              0.0
current_uc                                            0.0
dtype: float64
Chýbajúce hodnoty: Chýbajúce hodnoty v dátach. Je potrebné ich spracovať pred použitím pri ML.
In [132]:
data.nunique()[data.nunique() > 1].apply(lambda x: f"{x:<50}{data.nunique()[data.nunique() > 1].index[data.nunique()[data.nunique() > 1] == x][0]}")
Out[132]:
timestamp                                             29749                                         ...
amf_session_value                                     5                                             ...
bearers_active_value                                  5                                             ...
fivegs_amffunction_amf_authreq_value                  203                                           ...
fivegs_amffunction_mm_confupdate_value                198                                           ...
fivegs_amffunction_rm_reginitreq_value                227                                           ...
fivegs_amffunction_rm_reginitsucc_value               198                                           ...
fivegs_amffunction_rm_registeredsubnbr_value          5                                             ...
fivegs_pcffunction_pa_policyamassoreq_value           202                                           ...
fivegs_pcffunction_pa_policyamassosucc_value          202                                           ...
fivegs_pcffunction_pa_policysmassoreq_value           202                                           ...
fivegs_pcffunction_pa_policysmassosucc_value          202                                           ...
fivegs_pcffunction_pa_sessionnbr_value                5                                             ...
fivegs_smffunction_sm_pdusessioncreationreq_value     203                                           ...
fivegs_smffunction_sm_pdusessioncreationsucc_value    203                                           ...
fivegs_smffunction_sm_qos_flow_nbr_value              203                                           ...
fivegs_smffunction_sm_sessionnbr_value                5                                             ...
fivegs_upffunction_sm_n4sessionestabreq_value         200                                           ...
fivegs_upffunction_upf_qosflows_value                 5                                             ...
fivegs_upffunction_upf_sessionnbr_value               5                                             ...
process_cpu_seconds_total_value                       4372                                          ...
process_open_fds_value                                7                                             ...
process_resident_memory_bytes_value                   206                                           ...
process_start_time_seconds_value                      3                                             ...
process_virtual_memory_bytes_value                    4                                             ...
ran_ue_value                                          5                                             ...
application                                           6                                             ...
log_type                                              5                                             ...
current_uc                                            6                                             ...
dtype: object
In [133]:
data.duplicated().sum()
Out[133]:
np.int64(9728)
Duplikáty: Duplikáty v dátach znamenajú, že stav sa medzi jednotlivými meraniami nezmenil.

Záver ¶

Čo treba zodpovedať:

  1. Chýbajúce hodnoty:

    • Koľko chýbajúcich hodnôt je v každom stĺpci?
      • žiadne chýbajúce hodnoty
  2. Dátové typy:

    • Aké sú dátové typy každého stĺpca?
      • timestamp object
      • application object
      • log_type object
      • current_uc object
      • Ostatné stĺpce sú float64
    • Ako konvertovať dátové typy?
      • Map the columns to the correct data types using the astype() method
      • Použite astype() na konverziu stĺpcov na správne dátové typy.
  3. Duplikáty:

    • Aké sú duplicitné stĺpce v datasete?
      • Viacero riadkov má iba jednu hodnotu, môžeme tieto stĺpce odstrániť.
    • Koľko duplicitných riadkov je v datasete?
      • 9728
    • Ako odstrániť duplikáty?
      • Používame časové údaje, takže duplikáty znamenajú, že sa stav nezmenil.

Príprava dát¶

In [134]:
# Remove the columns with only one unique value
data = data.loc[:, data.nunique() > 1]
In [135]:
# Missing values imputation
data.fillna(data.mode().iloc[0], inplace=True)
# Check for missing values again
data.isnull().sum()[data.isnull().sum() > 0]
Out[135]:
Series([], dtype: int64)
Dáta: Odstránené stĺpce, ktoré nie sú potrebné pre analýzu. Chýbajúce hodnoty boli nahradené módom stĺpca.
In [136]:
# Convert timestamp to datetime
data['timestamp'] = pd.to_datetime(data['timestamp'])

# Map string values to numerical values
with open('./json/log_map.json', 'r') as f:
    LOG_MAP = json.load(f)

with open('./json/app_map.json', 'r') as f:
    APP_MAP = json.load(f)

with open('./json/uc_map.json', 'r') as f:
    UC_MAP = json.load(f)

data['application'] = data['application'].map(APP_MAP)
data['log_type'] = data['log_type'].map(LOG_MAP)
data['current_uc'] = data['current_uc'].map(UC_MAP)

# Check if the values were correctly mapped
data.dtypes
Out[136]:
timestamp                                             datetime64[ns]
amf_session_value                                            float64
bearers_active_value                                         float64
fivegs_amffunction_amf_authreq_value                         float64
fivegs_amffunction_mm_confupdate_value                       float64
fivegs_amffunction_rm_reginitreq_value                       float64
fivegs_amffunction_rm_reginitsucc_value                      float64
fivegs_amffunction_rm_registeredsubnbr_value                 float64
fivegs_pcffunction_pa_policyamassoreq_value                  float64
fivegs_pcffunction_pa_policyamassosucc_value                 float64
fivegs_pcffunction_pa_policysmassoreq_value                  float64
fivegs_pcffunction_pa_policysmassosucc_value                 float64
fivegs_pcffunction_pa_sessionnbr_value                       float64
fivegs_smffunction_sm_pdusessioncreationreq_value            float64
fivegs_smffunction_sm_pdusessioncreationsucc_value           float64
fivegs_smffunction_sm_qos_flow_nbr_value                     float64
fivegs_smffunction_sm_sessionnbr_value                       float64
fivegs_upffunction_sm_n4sessionestabreq_value                float64
fivegs_upffunction_upf_qosflows_value                        float64
fivegs_upffunction_upf_sessionnbr_value                      float64
process_cpu_seconds_total_value                              float64
process_open_fds_value                                       float64
process_resident_memory_bytes_value                          float64
process_start_time_seconds_value                             float64
process_virtual_memory_bytes_value                           float64
ran_ue_value                                                 float64
application                                                    int64
log_type                                                       int64
current_uc                                                     int64
dtype: object
Dátové typy: Všetky stĺpce boli konvertované na správne dátové typy ('float64', 'int64', 'datetime64[ns]').

Analýza jednotlivých stĺpcov ¶

In [137]:
plt.figure(figsize=(8,5))
sns.countplot(x='current_uc', data=data)
plt.title("Distribúcia UC tried (current_uc)")
plt.xlabel("UC trieda")
plt.ylabel("Počet vzoriek")
plt.show()
No description has been provided for this image
Rozdelenie UC: Treba brať do úvahy, že niektoré používateľské prípady (UC) sa v datasete vyskytujú častejšie ako iné.
In [138]:
# Class weights for imbalanced classes
classes = np.unique(data['current_uc'].dropna())
weights = compute_class_weight(class_weight='balanced', classes=classes, y=data['current_uc'].dropna())
class_weights = dict(zip(classes, weights))
print("Class weights:", class_weights)
Class weights: {np.int64(0): np.float64(0.7493326488706366), np.int64(1): np.float64(1.159228081321474), np.int64(2): np.float64(0.8667022918893243), np.int64(3): np.float64(0.6281521645580515), np.int64(4): np.float64(1.4750404203718674), np.int64(5): np.float64(2.637694253704373)}
In [139]:
# Save class weights to JSON
class_weights_serializable = {int(k): float(v) for k, v in class_weights.items()}
with open('./json/class_weights.json', 'w') as f:
    json.dump(class_weights_serializable, f)
Rozdelenie UC: Tento problém je možné vyriešiť pomocou váženého priemeru.
In [140]:
# Select numerical columns for feature selection
numerical_cols = data.select_dtypes(include=['float64', 'int64']).columns.tolist()

n_cols = 3
n_rows = math.ceil(len(numerical_cols) / n_cols)
fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, 5 * n_rows))


axes = axes.flatten()

for i, col in enumerate(numerical_cols):
  # Histogram
  sns.histplot(data[col], kde=True, bins=30, ax=axes[i])
  axes[i].set_title(f'Distribúcia: {col}')
  axes[i].set_xlabel(col)
  axes[i].set_ylabel('Frekvencia')

# Remove empty axes
for j in range(i + 1, len(axes)):
  fig.delaxes(axes[j])
plt.tight_layout()
plt.show()
No description has been provided for this image
In [141]:
# Correlation matrix
corr = data.corr()


plt.figure(figsize=(25, 25))
plt.title("Correlation Matrix")

sns.heatmap(corr, annot=True, fmt=".2f", cmap='coolwarm', square=True, cbar_kws={"shrink": .8})
plt.xticks(rotation=45)
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()
No description has been provided for this image
Korelácia: Žiadna metrika nemá silnú koreláciu s UC triedou, ktorú budeme klasifikovať.

Dataset z reálnej premávky ¶

In [142]:
real_data = pd.read_csv("../datasets/real_network_data_after_labeling.csv")
real_data.head()
Out[142]:
timestamp amf_session_value bearers_active_value fivegs_amffunction_amf_authreject_value fivegs_amffunction_amf_authreq_value fivegs_amffunction_mm_confupdate_value fivegs_amffunction_mm_confupdatesucc_value fivegs_amffunction_mm_paging5greq_value fivegs_amffunction_mm_paging5gsucc_value fivegs_amffunction_rm_regemergreq_value ... process_resident_memory_bytes_value process_start_time_seconds_value process_virtual_memory_bytes_value process_virtual_memory_max_bytes_value ran_ue_value s5c_rx_createsession_value s5c_rx_parse_failed_value application log_type current_uc
0 2025-04-10 12:28:14 2.0 2.0 0.0 10.0 599.0 499.0 1034.0 498.0 0.0 ... 50106368.0 118464611.5 1.404078e+09 -1.0 0.0 0.0 0.0 0 0 uc1
1 2025-04-10 12:28:15 2.0 2.0 0.0 10.0 599.0 499.0 1034.0 498.0 0.0 ... 50106368.0 118464611.5 1.404078e+09 -1.0 0.0 0.0 0.0 0 0 uc1
2 2025-04-10 12:28:16 2.0 2.0 0.0 10.0 599.0 499.0 1034.0 498.0 0.0 ... 50106368.0 118464611.5 1.404078e+09 -1.0 0.0 0.0 0.0 0 0 uc1
3 2025-04-10 12:28:17 2.0 2.0 0.0 10.0 599.0 499.0 1034.0 498.0 0.0 ... 50106368.0 118464611.5 1.404078e+09 -1.0 0.0 0.0 0.0 0 0 uc1
4 2025-04-10 12:28:18 2.0 2.0 0.0 10.0 599.0 499.0 1034.0 498.0 0.0 ... 50106368.0 118464611.5 1.404078e+09 -1.0 0.0 0.0 0.0 0 0 uc1

5 rows × 58 columns

Chýbajúce hodnoty, Dátové typy, Duplikáty a Deskriptívna štatistika ¶

In [143]:
real_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6785 entries, 0 to 6784
Data columns (total 58 columns):
 #   Column                                              Non-Null Count  Dtype  
---  ------                                              --------------  -----  
 0   timestamp                                           6785 non-null   object 
 1   amf_session_value                                   6785 non-null   float64
 2   bearers_active_value                                6785 non-null   float64
 3   fivegs_amffunction_amf_authreject_value             6785 non-null   float64
 4   fivegs_amffunction_amf_authreq_value                6785 non-null   float64
 5   fivegs_amffunction_mm_confupdate_value              6785 non-null   float64
 6   fivegs_amffunction_mm_confupdatesucc_value          6785 non-null   float64
 7   fivegs_amffunction_mm_paging5greq_value             6785 non-null   float64
 8   fivegs_amffunction_mm_paging5gsucc_value            6785 non-null   float64
 9   fivegs_amffunction_rm_regemergreq_value             6785 non-null   float64
 10  fivegs_amffunction_rm_regemergsucc_value            6785 non-null   float64
 11  fivegs_amffunction_rm_reginitreq_value              6785 non-null   float64
 12  fivegs_amffunction_rm_reginitsucc_value             6785 non-null   float64
 13  fivegs_amffunction_rm_registeredsubnbr_value        6785 non-null   float64
 14  fivegs_amffunction_rm_regmobreq_value               6785 non-null   float64
 15  fivegs_amffunction_rm_regmobsucc_value              6785 non-null   float64
 16  fivegs_amffunction_rm_regperiodreq_value            6785 non-null   float64
 17  fivegs_amffunction_rm_regperiodsucc_value           6785 non-null   float64
 18  fivegs_ep_n3_gtp_indatapktn3upf_value               6785 non-null   float64
 19  fivegs_ep_n3_gtp_outdatapktn3upf_value              6785 non-null   float64
 20  fivegs_pcffunction_pa_policyamassoreq_value         6785 non-null   float64
 21  fivegs_pcffunction_pa_policyamassosucc_value        6785 non-null   float64
 22  fivegs_pcffunction_pa_policysmassoreq_value         6785 non-null   float64
 23  fivegs_pcffunction_pa_policysmassosucc_value        6785 non-null   float64
 24  fivegs_pcffunction_pa_sessionnbr_value              6785 non-null   float64
 25  fivegs_smffunction_sm_n4sessionestabreq_value       6785 non-null   float64
 26  fivegs_smffunction_sm_n4sessionreport_value         6785 non-null   float64
 27  fivegs_smffunction_sm_n4sessionreportsucc_value     6785 non-null   float64
 28  fivegs_smffunction_sm_pdusessioncreationreq_value   6785 non-null   float64
 29  fivegs_smffunction_sm_pdusessioncreationsucc_value  6785 non-null   float64
 30  fivegs_smffunction_sm_qos_flow_nbr_value            6785 non-null   float64
 31  fivegs_smffunction_sm_sessionnbr_value              6785 non-null   float64
 32  fivegs_upffunction_sm_n4sessionestabreq_value       6785 non-null   float64
 33  fivegs_upffunction_sm_n4sessionreport_value         6785 non-null   float64
 34  fivegs_upffunction_sm_n4sessionreportsucc_value     6785 non-null   float64
 35  fivegs_upffunction_upf_qosflows_value               6785 non-null   float64
 36  fivegs_upffunction_upf_sessionnbr_value             6785 non-null   float64
 37  gn_rx_createpdpcontextreq_value                     6785 non-null   float64
 38  gn_rx_deletepdpcontextreq_value                     6785 non-null   float64
 39  gn_rx_parse_failed_value                            6785 non-null   float64
 40  gnb_value                                           6785 non-null   float64
 41  gtp1_pdpctxs_active_value                           6785 non-null   float64
 42  gtp2_sessions_active_value                          6785 non-null   float64
 43  gtp_new_node_failed_value                           6785 non-null   float64
 44  gtp_peers_active_value                              6785 non-null   float64
 45  process_cpu_seconds_total_value                     6785 non-null   float64
 46  process_max_fds_value                               6785 non-null   float64
 47  process_open_fds_value                              6785 non-null   float64
 48  process_resident_memory_bytes_value                 6785 non-null   float64
 49  process_start_time_seconds_value                    6785 non-null   float64
 50  process_virtual_memory_bytes_value                  6785 non-null   float64
 51  process_virtual_memory_max_bytes_value              6785 non-null   float64
 52  ran_ue_value                                        6785 non-null   float64
 53  s5c_rx_createsession_value                          6785 non-null   float64
 54  s5c_rx_parse_failed_value                           6785 non-null   float64
 55  application                                         6785 non-null   object 
 56  log_type                                            6785 non-null   object 
 57  current_uc                                          6785 non-null   object 
dtypes: float64(54), object(4)
memory usage: 3.0+ MB
In [144]:
real_data.describe(include='all')
Out[144]:
timestamp amf_session_value bearers_active_value fivegs_amffunction_amf_authreject_value fivegs_amffunction_amf_authreq_value fivegs_amffunction_mm_confupdate_value fivegs_amffunction_mm_confupdatesucc_value fivegs_amffunction_mm_paging5greq_value fivegs_amffunction_mm_paging5gsucc_value fivegs_amffunction_rm_regemergreq_value ... process_resident_memory_bytes_value process_start_time_seconds_value process_virtual_memory_bytes_value process_virtual_memory_max_bytes_value ran_ue_value s5c_rx_createsession_value s5c_rx_parse_failed_value application log_type current_uc
count 6785 6785.000000 6785.000000 6785.0 6785.000000 6785.000000 6785.000000 6785.000000 6785.000000 6785.0 ... 6.785000e+03 6.785000e+03 6.785000e+03 6785.0 6785.000000 6785.0 6785.0 6785 6785 6785
unique 5084 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN 4 6 6
top 2025-04-10 13:20:34 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN 0 0 uc1
freq 9 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN 4572 4572 3145
mean NaN 2.791452 2.766839 0.0 12.206338 734.754901 531.113780 1096.827119 529.088578 0.0 ... 5.341072e+07 1.185351e+08 1.405344e+09 -1.0 1.922918 0.0 0.0 NaN NaN NaN
std NaN 1.494975 1.526840 0.0 2.765957 282.378079 125.164509 256.344911 124.266089 0.0 ... 2.649311e+06 1.159745e+06 2.082018e+07 0.0 1.353971 0.0 0.0 NaN NaN NaN
min NaN 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 ... 5.010637e+07 1.184646e+08 1.404078e+09 -1.0 0.000000 0.0 0.0 NaN NaN NaN
25% NaN 1.000000 1.000000 0.0 12.000000 618.000000 506.000000 1048.000000 505.000000 0.0 ... 5.169664e+07 1.184646e+08 1.404078e+09 -1.0 1.000000 0.0 0.0 NaN NaN NaN
50% NaN 4.000000 4.000000 0.0 12.000000 633.000000 521.000000 1078.000000 519.000000 0.0 ... 5.211750e+07 1.184646e+08 1.404078e+09 -1.0 2.000000 0.0 0.0 NaN NaN NaN
75% NaN 4.000000 4.000000 0.0 12.000000 640.000000 527.000000 1090.000000 525.000000 0.0 ... 5.765837e+07 1.184646e+08 1.404078e+09 -1.0 3.000000 0.0 0.0 NaN NaN NaN
max NaN 4.000000 4.000000 0.0 15.000000 1211.000000 671.000000 1378.000000 667.000000 0.0 ... 6.247902e+07 1.376044e+08 1.747683e+09 -1.0 5.000000 0.0 0.0 NaN NaN NaN

11 rows × 58 columns

In [145]:
real_data.isnull().sum()[real_data.isnull().sum() > 0]
Out[145]:
Series([], dtype: int64)
In [146]:
real_data.nunique()[real_data.nunique() > 1].apply(lambda x: f"{x:<50}{real_data.nunique()[real_data.nunique() > 1].index[real_data.nunique()[real_data.nunique() > 1] == x][0]}")
Out[146]:
timestamp                                             5084                                          ...
amf_session_value                                     5                                             ...
bearers_active_value                                  5                                             ...
fivegs_amffunction_amf_authreq_value                  4                                             ...
fivegs_amffunction_mm_confupdate_value                31                                            ...
fivegs_amffunction_mm_confupdatesucc_value            27                                            ...
fivegs_amffunction_mm_paging5greq_value               27                                            ...
fivegs_amffunction_mm_paging5gsucc_value              26                                            ...
fivegs_amffunction_rm_reginitreq_value                9                                             ...
fivegs_amffunction_rm_reginitsucc_value               5                                             ...
fivegs_amffunction_rm_registeredsubnbr_value          5                                             ...
fivegs_amffunction_rm_regmobreq_value                 4                                             ...
fivegs_amffunction_rm_regmobsucc_value                4                                             ...
fivegs_amffunction_rm_regperiodreq_value              7                                             ...
fivegs_amffunction_rm_regperiodsucc_value             7                                             ...
fivegs_pcffunction_pa_policyamassoreq_value           5                                             ...
fivegs_pcffunction_pa_policyamassosucc_value          5                                             ...
fivegs_pcffunction_pa_policysmassoreq_value           79                                            ...
fivegs_pcffunction_pa_policysmassosucc_value          79                                            ...
fivegs_pcffunction_pa_sessionnbr_value                7                                             ...
fivegs_smffunction_sm_n4sessionestabreq_value         2                                             ...
fivegs_smffunction_sm_n4sessionreport_value           54                                            ...
fivegs_smffunction_sm_n4sessionreportsucc_value       54                                            ...
fivegs_smffunction_sm_pdusessioncreationreq_value     75                                            ...
fivegs_smffunction_sm_pdusessioncreationsucc_value    75                                            ...
fivegs_smffunction_sm_qos_flow_nbr_value              75                                            ...
fivegs_smffunction_sm_sessionnbr_value                5                                             ...
fivegs_upffunction_sm_n4sessionestabreq_value         79                                            ...
fivegs_upffunction_sm_n4sessionreport_value           55                                            ...
fivegs_upffunction_sm_n4sessionreportsucc_value       54                                            ...
fivegs_upffunction_upf_qosflows_value                 9                                             ...
fivegs_upffunction_upf_sessionnbr_value               7                                             ...
gn_rx_createpdpcontextreq_value                       3                                             ...
gn_rx_deletepdpcontextreq_value                       3                                             ...
gn_rx_parse_failed_value                              3                                             ...
gnb_value                                             3                                             ...
gtp1_pdpctxs_active_value                             2                                             ...
process_cpu_seconds_total_value                       105                                           ...
process_open_fds_value                                5                                             ...
process_resident_memory_bytes_value                   370                                           ...
process_start_time_seconds_value                      2                                             ...
process_virtual_memory_bytes_value                    2                                             ...
ran_ue_value                                          6                                             ...
application                                           4                                             ...
log_type                                              6                                             ...
current_uc                                            6                                             ...
dtype: object
In [147]:
duplicates = real_data.duplicated()
duplicates_sum = duplicates.sum()
print(f"Total duplicates: {duplicates_sum}")
Total duplicates: 0
Dáta: V reálnych dátach sa nevyskytujú žiadne chýbajúce hodnoty ani duplikáty.

Záver ¶

Čo treba zodpovedať:

  1. Chýbajúce hodnoty:

    • Koľko chýbajúcich hodnôt je v každom stĺpci?
      • Žiadne chýbajúce hodnoty
  2. Dátové typy:

    • Aké sú dátové typy každého stĺpca?
      • timestamp object
      • application object
      • log_type object
      • current_uc object
      • Ostatné stĺpce sú float64
    • Ako konvertovať dátové typy?
      • Map the columns to the correct data types using the astype() method
      • Použite astype() na konverziu stĺpcov na správne dátové typy.
  3. Duplikáty:

    • Aké sú duplicitné stĺpce v datasete?
      • V datasete nie sú žiadne duplicitné stĺpce.
    • Koľko duplicitných riadkov je v datasete?
      • 0

Príprava dát ¶

In [148]:
# Remove columns with only one unique value
real_data = real_data.loc[:, real_data.nunique() > 1]
In [149]:
# Missing values imputation
real_data.fillna(real_data.mode().iloc[0], inplace=True)

# Check for missing values again
real_data.isnull().sum()[real_data.isnull().sum() > 0]
Out[149]:
Series([], dtype: int64)
In [150]:
# Convert timestamp to datetime
real_data['timestamp'] = pd.to_datetime(real_data['timestamp'])

real_data['application'] = real_data['application'].map(APP_MAP)
real_data['log_type'] = real_data['log_type'].map(LOG_MAP)
real_data['current_uc'] = real_data['current_uc'].map(UC_MAP)

# Check if we mapped the values correctly
real_data.dtypes
Out[150]:
timestamp                                             datetime64[ns]
amf_session_value                                            float64
bearers_active_value                                         float64
fivegs_amffunction_amf_authreq_value                         float64
fivegs_amffunction_mm_confupdate_value                       float64
fivegs_amffunction_mm_confupdatesucc_value                   float64
fivegs_amffunction_mm_paging5greq_value                      float64
fivegs_amffunction_mm_paging5gsucc_value                     float64
fivegs_amffunction_rm_reginitreq_value                       float64
fivegs_amffunction_rm_reginitsucc_value                      float64
fivegs_amffunction_rm_registeredsubnbr_value                 float64
fivegs_amffunction_rm_regmobreq_value                        float64
fivegs_amffunction_rm_regmobsucc_value                       float64
fivegs_amffunction_rm_regperiodreq_value                     float64
fivegs_amffunction_rm_regperiodsucc_value                    float64
fivegs_pcffunction_pa_policyamassoreq_value                  float64
fivegs_pcffunction_pa_policyamassosucc_value                 float64
fivegs_pcffunction_pa_policysmassoreq_value                  float64
fivegs_pcffunction_pa_policysmassosucc_value                 float64
fivegs_pcffunction_pa_sessionnbr_value                       float64
fivegs_smffunction_sm_n4sessionestabreq_value                float64
fivegs_smffunction_sm_n4sessionreport_value                  float64
fivegs_smffunction_sm_n4sessionreportsucc_value              float64
fivegs_smffunction_sm_pdusessioncreationreq_value            float64
fivegs_smffunction_sm_pdusessioncreationsucc_value           float64
fivegs_smffunction_sm_qos_flow_nbr_value                     float64
fivegs_smffunction_sm_sessionnbr_value                       float64
fivegs_upffunction_sm_n4sessionestabreq_value                float64
fivegs_upffunction_sm_n4sessionreport_value                  float64
fivegs_upffunction_sm_n4sessionreportsucc_value              float64
fivegs_upffunction_upf_qosflows_value                        float64
fivegs_upffunction_upf_sessionnbr_value                      float64
gn_rx_createpdpcontextreq_value                              float64
gn_rx_deletepdpcontextreq_value                              float64
gn_rx_parse_failed_value                                     float64
gnb_value                                                    float64
gtp1_pdpctxs_active_value                                    float64
process_cpu_seconds_total_value                              float64
process_open_fds_value                                       float64
process_resident_memory_bytes_value                          float64
process_start_time_seconds_value                             float64
process_virtual_memory_bytes_value                           float64
ran_ue_value                                                 float64
application                                                    int64
log_type                                                       int64
current_uc                                                     int64
dtype: object
Dátové typy: Všetky stĺpce boli konvertované na správne dátové typy ('float64', 'int64', 'datetime64[ns]').

Vizualizácia dát ¶

In [151]:
# Select only numerical columns for feature selection
numerical_cols = real_data.select_dtypes(include=['float64', 'int64']).columns.tolist()

n_cols = 3
n_rows = math.ceil(len(numerical_cols) / n_cols)
fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, 5 * n_rows))
axes = axes.flatten()

for i, col in enumerate(numerical_cols):
  # Histogram
  sns.histplot(real_data[col], kde=True, bins=30, ax=axes[i])
  axes[i].set_title(f'Distribution: {col}')
  axes[i].set_xlabel(col)
  axes[i].set_ylabel('Frequency')

# Remove empty axes
for j in range(i + 1, len(axes)):
  fig.delaxes(axes[j])
plt.tight_layout()
plt.show()
No description has been provided for this image
In [152]:
# Correlation matrix
corr = real_data.corr()

plt.figure(figsize=(25, 25))
plt.title("Correlation Matrix")

sns.heatmap(corr, annot=True, fmt=".2f", cmap='coolwarm', square=True, cbar_kws={"shrink": .8})
plt.xticks(rotation=45)
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()
No description has been provided for this image
Korelácia: Narozdiel od syntetických dát, v reálnych dátach je silná korelácia medzi niektorými metrikami a UC triedou.

Výber metrík ¶

In [153]:
data = pd.read_csv("../datasets/simulated_network_data.csv")
real_data = pd.read_csv("../datasets/real_network_data_after_labeling.csv")
In [154]:
def preprocess_data(data):

    """
    Preprocess the input DataFrame: handle missing values, map categorical variables,
    select numerical features, and normalize the dataset.
    """
    
    data.fillna(data.mode().iloc[0], inplace=True)

    data['application'] = data['application'].map(APP_MAP)
    data['log_type'] = data['log_type'].map(LOG_MAP)
    data['current_uc'] = data['current_uc'].map(UC_MAP)

    # Numerical columns
    X = data.drop(columns=['timestamp', 'current_uc'], errors='ignore')
    X = X.select_dtypes(include=[np.number])
    y = data['current_uc'].astype(int)

    # Data scaling
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)

    return X_scaled, X, y
In [155]:
def base_estimator(X_scaled, X, y):

    """Fit a Random Forest model and calculate feature importances."""

    rf = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
    
    # Fit the model
    rf.fit(X_scaled, y)
    rf_importances = pd.Series(rf.feature_importances_, index=X.columns).sort_values(ascending=False)

    return rf_importances, rf
In [156]:
def RFE_def(X_scaled, y, X, rf):
    
    """Apply Recursive Feature Elimination (RFE) for feature selection."""
    
    rfe = RFE(estimator=rf, n_features_to_select=10)
    rfe.fit(X_scaled, y)
    rfe_selected = pd.Series(rfe.support_, index=X.columns)

    return rfe_selected
In [157]:
def RFECV_def(X_scaled, y, X, rf):

  """Apply Recursive Feature Elimination with Cross-Validation (RFECV) for feature selection."""
  
  rfecv = RFECV(estimator=rf, step=1, cv=StratifiedKFold(5), scoring='f1_weighted', n_jobs=-1)
  rfecv.fit(X_scaled, y)
  rfecv_selected = pd.Series(rfecv.support_, index=X.columns)

  return rfecv_selected
In [158]:
def SFSX_def(X_scaled, y, X, rf):
  
  """Sequential Feature Selector for feature selection."""

  sfs = SequentialFeatureSelector(rf, n_features_to_select=10, direction='forward', scoring='f1_weighted', cv=5, n_jobs=-1)
  sfs.fit(X_scaled, y)
  sfs_selected = pd.Series(sfs.get_support(), index=X.columns)

  return sfs_selected
In [159]:
X_scaled, X, y = preprocess_data(data)


rf_importances, rf = base_estimator(X_scaled, X, y)
rfe_selected = RFE_def(X_scaled, y, X, rf)
rfecv_selected = RFECV_def(X_scaled, y, X, rf)
sfs_selected = SFSX_def(X_scaled, y, X, rf)

summary_df = pd.DataFrame({
    'Feature': X.columns,
    'RF_Importance': rf_importances,
    'Selected_RFE': rfe_selected,
    'Selected_RFECV': rfecv_selected,
    'Selected_SFS': sfs_selected
}).sort_values(by='RF_Importance', ascending=False)

summary_df
Out[159]:
Feature RF_Importance Selected_RFE Selected_RFECV Selected_SFS
process_cpu_seconds_total_value process_open_fds_value 0.140806 True True False
ran_ue_value s5c_rx_parse_failed_value 0.127135 True True True
fivegs_amffunction_rm_reginitreq_value fivegs_amffunction_rm_reginitsucc_value 0.091784 True True False
process_resident_memory_bytes_value process_virtual_memory_bytes_value 0.061610 True True False
fivegs_upffunction_sm_n4sessionestabreq_value fivegs_upffunction_sm_n4sessionreport_value 0.048814 True True False
fivegs_amffunction_rm_reginitsucc_value fivegs_amffunction_rm_registeredsubnbr_value 0.048154 True True False
fivegs_amffunction_amf_authreq_value fivegs_amffunction_mm_confupdate_value 0.046038 True True False
fivegs_pcffunction_pa_policyamassoreq_value fivegs_pcffunction_pa_policyamassosucc_value 0.044084 True True False
fivegs_amffunction_mm_confupdate_value fivegs_amffunction_mm_confupdatesucc_value 0.042526 True True False
process_start_time_seconds_value process_virtual_memory_max_bytes_value 0.041686 False True False
fivegs_pcffunction_pa_policysmassoreq_value fivegs_pcffunction_pa_policysmassosucc_value 0.038464 False True False
fivegs_smffunction_sm_pdusessioncreationreq_value fivegs_smffunction_sm_pdusessioncreationsucc_v... 0.038186 False True False
fivegs_pcffunction_pa_policyamassosucc_value fivegs_pcffunction_pa_policysmassoreq_value 0.035690 True True False
fivegs_smffunction_sm_pdusessioncreationsucc_value fivegs_smffunction_sm_qos_flow_nbr_value 0.035656 False True False
fivegs_pcffunction_pa_policysmassosucc_value fivegs_pcffunction_pa_sessionnbr_value 0.035275 False True False
fivegs_smffunction_sm_qos_flow_nbr_value fivegs_smffunction_sm_sessionnbr_value 0.035166 False True True
application bearers_active_value 0.023513 False True True
log_type process_max_fds_value 0.021229 False True False
process_open_fds_value process_start_time_seconds_value 0.011045 False True True
process_virtual_memory_bytes_value ran_ue_value 0.009324 False True True
bearers_active_value fivegs_amffunction_amf_authreject_value 0.003989 False True True
fivegs_pcffunction_pa_sessionnbr_value fivegs_smffunction_sm_n4sessionestabreq_value 0.003895 False True True
fivegs_smffunction_sm_sessionnbr_value fivegs_upffunction_sm_n4sessionestabreq_value 0.003616 False True False
fivegs_upffunction_upf_sessionnbr_value gn_rx_createpdpcontextreq_value 0.003378 False True True
fivegs_amffunction_rm_registeredsubnbr_value fivegs_amffunction_rm_regmobreq_value 0.003079 False True False
fivegs_upffunction_upf_qosflows_value fivegs_upffunction_upf_sessionnbr_value 0.002948 False True True
amf_session_value amf_session_value 0.002910 False True False
fivegs_smffunction_sm_n4sessionreportsucc_value fivegs_smffunction_sm_pdusessioncreationreq_value 0.000000 False False False
gtp2_sessions_active_value gtp_new_node_failed_value 0.000000 False False False
s5c_rx_createsession_value application 0.000000 False True False
fivegs_amffunction_amf_authreject_value fivegs_amffunction_amf_authreq_value 0.000000 False True True
process_virtual_memory_max_bytes_value s5c_rx_createsession_value 0.000000 False True False
fivegs_amffunction_mm_confupdatesucc_value fivegs_amffunction_mm_paging5greq_value 0.000000 False True False
fivegs_amffunction_mm_paging5greq_value fivegs_amffunction_mm_paging5gsucc_value 0.000000 False True False
fivegs_amffunction_mm_paging5gsucc_value fivegs_amffunction_rm_regemergreq_value 0.000000 False True False
fivegs_amffunction_rm_regemergreq_value fivegs_amffunction_rm_regemergsucc_value 0.000000 False False False
process_max_fds_value process_resident_memory_bytes_value 0.000000 False True False
fivegs_amffunction_rm_regemergsucc_value fivegs_amffunction_rm_reginitreq_value 0.000000 False False False
fivegs_amffunction_rm_regmobreq_value fivegs_amffunction_rm_regmobsucc_value 0.000000 False False False
gtp_peers_active_value process_cpu_seconds_total_value 0.000000 False True False
gtp_new_node_failed_value gtp_peers_active_value 0.000000 False False False
gtp1_pdpctxs_active_value gtp2_sessions_active_value 0.000000 False False False
fivegs_smffunction_sm_n4sessionreport_value fivegs_smffunction_sm_n4sessionreportsucc_value 0.000000 False False False
gnb_value gtp1_pdpctxs_active_value 0.000000 False False False
gn_rx_parse_failed_value gnb_value 0.000000 False False False
gn_rx_deletepdpcontextreq_value gn_rx_parse_failed_value 0.000000 False False False
gn_rx_createpdpcontextreq_value gn_rx_deletepdpcontextreq_value 0.000000 False False False
fivegs_amffunction_rm_regmobsucc_value fivegs_amffunction_rm_regperiodreq_value 0.000000 False False False
fivegs_amffunction_rm_regperiodreq_value fivegs_amffunction_rm_regperiodsucc_value 0.000000 False False False
fivegs_upffunction_sm_n4sessionreportsucc_value fivegs_upffunction_upf_qosflows_value 0.000000 False False False
fivegs_upffunction_sm_n4sessionreport_value fivegs_upffunction_sm_n4sessionreportsucc_value 0.000000 False False False
fivegs_amffunction_rm_regperiodsucc_value fivegs_ep_n3_gtp_indatapktn3upf_value 0.000000 False False False
fivegs_ep_n3_gtp_indatapktn3upf_value fivegs_ep_n3_gtp_outdatapktn3upf_value 0.000000 False False False
fivegs_ep_n3_gtp_outdatapktn3upf_value fivegs_pcffunction_pa_policyamassoreq_value 0.000000 False False False
fivegs_smffunction_sm_n4sessionestabreq_value fivegs_smffunction_sm_n4sessionreport_value 0.000000 False False False
s5c_rx_parse_failed_value log_type 0.000000 False True False
In [160]:
X_scaled_real, X_real, y_real = preprocess_data(real_data)

rf_importances_real, rf_real = base_estimator(X_scaled_real, X_real, y_real)
rfe_selected_real = RFE_def(X_scaled_real, y_real, X_real, rf_real)
rfecv_selected_real = RFECV_def(X_scaled_real, y_real, X_real, rf_real)
sfs_selected_real = SFSX_def(X_scaled_real, y_real, X_real, rf_real)

summary_real_df = pd.DataFrame({
    'Feature': X_real.columns,
    'RF_Importance': rf_importances_real,
    'Selected_RFE': rfe_selected_real,
    'Selected_RFECV': rfecv_selected_real,
    'Selected_SFS': sfs_selected_real
}).sort_values(by='RF_Importance', ascending=False)

summary_real_df
Out[160]:
Feature RF_Importance Selected_RFE Selected_RFECV Selected_SFS
process_resident_memory_bytes_value process_virtual_memory_bytes_value 0.115705 True True False
process_cpu_seconds_total_value process_open_fds_value 0.070902 True True False
fivegs_smffunction_sm_pdusessioncreationreq_value fivegs_smffunction_sm_pdusessioncreationsucc_v... 0.055278 True True False
fivegs_smffunction_sm_n4sessionreportsucc_value fivegs_smffunction_sm_pdusessioncreationreq_value 0.054883 True True True
fivegs_smffunction_sm_pdusessioncreationsucc_value fivegs_smffunction_sm_qos_flow_nbr_value 0.052874 False True False
fivegs_pcffunction_pa_policysmassosucc_value fivegs_pcffunction_pa_sessionnbr_value 0.050845 True True False
fivegs_pcffunction_pa_policysmassoreq_value fivegs_pcffunction_pa_policysmassosucc_value 0.050003 False True False
fivegs_amffunction_mm_paging5gsucc_value fivegs_amffunction_rm_regemergreq_value 0.046187 False True False
fivegs_smffunction_sm_n4sessionreport_value fivegs_smffunction_sm_n4sessionreportsucc_value 0.044077 True True True
fivegs_smffunction_sm_qos_flow_nbr_value fivegs_smffunction_sm_sessionnbr_value 0.043774 True True False
fivegs_upffunction_sm_n4sessionreportsucc_value fivegs_upffunction_upf_qosflows_value 0.042791 False True False
fivegs_amffunction_mm_confupdatesucc_value fivegs_amffunction_mm_paging5greq_value 0.038687 False True False
fivegs_upffunction_sm_n4sessionestabreq_value fivegs_upffunction_sm_n4sessionreport_value 0.034406 True True False
fivegs_upffunction_upf_sessionnbr_value gn_rx_createpdpcontextreq_value 0.028330 True True False
fivegs_upffunction_upf_qosflows_value fivegs_upffunction_upf_sessionnbr_value 0.026020 False True False
fivegs_upffunction_sm_n4sessionreport_value fivegs_upffunction_sm_n4sessionreportsucc_value 0.023817 True True False
fivegs_amffunction_mm_confupdate_value fivegs_amffunction_mm_confupdatesucc_value 0.022418 False False False
fivegs_pcffunction_pa_sessionnbr_value fivegs_smffunction_sm_n4sessionestabreq_value 0.021964 False True False
fivegs_amffunction_rm_reginitsucc_value fivegs_amffunction_rm_registeredsubnbr_value 0.020152 False False False
fivegs_pcffunction_pa_policyamassosucc_value fivegs_pcffunction_pa_policysmassoreq_value 0.018935 False False False
amf_session_value amf_session_value 0.018332 False False True
bearers_active_value fivegs_amffunction_amf_authreject_value 0.017998 False True True
fivegs_amffunction_mm_paging5greq_value fivegs_amffunction_mm_paging5gsucc_value 0.017492 False True False
fivegs_smffunction_sm_sessionnbr_value fivegs_upffunction_sm_n4sessionestabreq_value 0.016850 False False False
fivegs_amffunction_rm_reginitreq_value fivegs_amffunction_rm_reginitsucc_value 0.015055 False False False
fivegs_pcffunction_pa_policyamassoreq_value fivegs_pcffunction_pa_policyamassosucc_value 0.011432 False False False
ran_ue_value s5c_rx_parse_failed_value 0.009158 False False False
fivegs_amffunction_rm_registeredsubnbr_value fivegs_amffunction_rm_regmobreq_value 0.004531 False False False
fivegs_amffunction_amf_authreq_value fivegs_amffunction_mm_confupdate_value 0.004071 False False True
fivegs_amffunction_rm_regmobreq_value fivegs_amffunction_rm_regmobsucc_value 0.003710 False False False
fivegs_amffunction_rm_regmobsucc_value fivegs_amffunction_rm_regperiodreq_value 0.002589 False False False
gnb_value gtp1_pdpctxs_active_value 0.002511 False False False
log_type process_max_fds_value 0.002398 False False False
application bearers_active_value 0.002091 False False True
fivegs_amffunction_rm_regperiodreq_value fivegs_amffunction_rm_regperiodsucc_value 0.001830 False False False
gn_rx_parse_failed_value gnb_value 0.001824 False False False
gn_rx_createpdpcontextreq_value gn_rx_deletepdpcontextreq_value 0.001713 False False True
fivegs_smffunction_sm_n4sessionestabreq_value fivegs_smffunction_sm_n4sessionreport_value 0.001483 False False False
fivegs_amffunction_rm_regperiodsucc_value fivegs_ep_n3_gtp_indatapktn3upf_value 0.001000 False False False
gtp1_pdpctxs_active_value gtp2_sessions_active_value 0.000827 False False False
gn_rx_deletepdpcontextreq_value gn_rx_parse_failed_value 0.000644 False False False
process_open_fds_value process_start_time_seconds_value 0.000220 False False True
process_start_time_seconds_value process_virtual_memory_max_bytes_value 0.000148 False False False
process_virtual_memory_bytes_value ran_ue_value 0.000048 False False False
fivegs_ep_n3_gtp_outdatapktn3upf_value fivegs_pcffunction_pa_policyamassoreq_value 0.000000 False False False
gtp2_sessions_active_value gtp_new_node_failed_value 0.000000 False False True
gtp_new_node_failed_value gtp_peers_active_value 0.000000 False False False
gtp_peers_active_value process_cpu_seconds_total_value 0.000000 False False False
fivegs_amffunction_rm_regemergreq_value fivegs_amffunction_rm_regemergsucc_value 0.000000 False False False
fivegs_amffunction_amf_authreject_value fivegs_amffunction_amf_authreq_value 0.000000 False False True
process_max_fds_value process_resident_memory_bytes_value 0.000000 False False False
fivegs_amffunction_rm_regemergsucc_value fivegs_amffunction_rm_reginitreq_value 0.000000 False False False
process_virtual_memory_max_bytes_value s5c_rx_createsession_value 0.000000 False False False
fivegs_ep_n3_gtp_indatapktn3upf_value fivegs_ep_n3_gtp_outdatapktn3upf_value 0.000000 False False False
s5c_rx_createsession_value application 0.000000 False False False
s5c_rx_parse_failed_value log_type 0.000000 False False False
In [161]:
# Load the RF importances for synthetic and real data
rf_synth = summary_df[['Feature', 'RF_Importance']]
rf_real = summary_real_df[['Feature', 'RF_Importance']] 

# Rename columns for synthetic and real data
rf_synth = rf_synth.rename(columns={'RF_Importance': 'Importance_Synthetic'})
rf_real = rf_real.rename(columns={'RF_Importance': 'Importance_Real'})

# Merge the two DataFrames on 'Feature'
merged = pd.merge(rf_synth, rf_real, on='Feature', how='inner')

# Select the top N features based on the maximum importance from both datasets
top_n = 25
merged['Combined'] = merged[['Importance_Synthetic', 'Importance_Real']].max(axis=1)
top_features = merged.sort_values(by='Combined', ascending=False).head(top_n)

# Graphical representation of the feature importances
plt.figure(figsize=(12, 6))
bar_width = 0.4
indices = range(len(top_features))

plt.barh([i + bar_width for i in indices], top_features['Importance_Synthetic'], height=bar_width, label='Synthetic', color='skyblue')
plt.barh(indices, top_features['Importance_Real'], height=bar_width, label='Real', color='salmon')

plt.yticks([i + bar_width/2 for i in indices], top_features['Feature'])
plt.xlabel('Random Forest Feature Importance')
plt.title('Porovnanie dôležitosti príznakov (Synthetic vs Real)')
plt.legend()
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()
No description has been provided for this image
In [162]:
comparison_df = pd.DataFrame({
    'Feature': [top_features['Feature'].iloc[i] for i in range(len(top_features))],
    'RF_Synthetic': summary_df['RF_Importance'].head(25).tolist(),
    'RF_Real': summary_real_df['RF_Importance'].head(25).tolist(),
})

# Thresholds for feature selection
real_thresh = 0.03
synthetic_thresh = 0.01

# Select features based on the thresholds
selected_features = comparison_df[
    (comparison_df['RF_Real'] >= real_thresh) &
    (comparison_df['RF_Synthetic'] >= synthetic_thresh)
]['Feature'].tolist()

selected_features
Out[162]:
['process_open_fds_value',
 's5c_rx_parse_failed_value',
 'process_virtual_memory_bytes_value',
 'fivegs_amffunction_rm_reginitsucc_value',
 'fivegs_smffunction_sm_pdusessioncreationsucc_value',
 'fivegs_smffunction_sm_pdusessioncreationreq_value',
 'fivegs_smffunction_sm_qos_flow_nbr_value',
 'fivegs_pcffunction_pa_sessionnbr_value',
 'fivegs_pcffunction_pa_policysmassosucc_value',
 'fivegs_upffunction_sm_n4sessionreport_value',
 'fivegs_amffunction_rm_registeredsubnbr_value',
 'fivegs_amffunction_rm_regemergreq_value',
 'fivegs_amffunction_mm_confupdate_value']
In [163]:
# Check if selected features will pass the RFE, RFECV, and SFS test (at least 2/3)
after_test_selected = selected_features.copy()
after_test_selected = [
    feature for feature in selected_features 
    if sum([
        summary_df.set_index('Feature').at[feature, 'Selected_RFE'],
        summary_df.set_index('Feature').at[feature, 'Selected_RFECV'],
        summary_df.set_index('Feature').at[feature, 'Selected_SFS']
    ]) >= 1
]
selected_features = after_test_selected
selected_features
Out[163]:
['process_open_fds_value',
 's5c_rx_parse_failed_value',
 'process_virtual_memory_bytes_value',
 'fivegs_amffunction_rm_reginitsucc_value',
 'fivegs_smffunction_sm_pdusessioncreationsucc_value',
 'fivegs_smffunction_sm_qos_flow_nbr_value',
 'fivegs_pcffunction_pa_sessionnbr_value',
 'fivegs_pcffunction_pa_policysmassosucc_value',
 'fivegs_upffunction_sm_n4sessionreport_value',
 'fivegs_amffunction_rm_registeredsubnbr_value',
 'fivegs_amffunction_rm_regemergreq_value',
 'fivegs_amffunction_mm_confupdate_value']
Vybrané metriky: Metriky, ktoré spĺňajú podmienky pre výber metrík z reálnych dát a syntetických dát.
Vybrané metriky: Výber metrík bol vykonaný pomocou skóre dôležitosti Random Forest na syntetických a reálnych datasetoch 5G siete. Metriky boli ponechané iba v prípade, že mali dôležitosť ≥ 0.03 v reálnych dátach (čo naznačuje relevantnosť v reálnom svete) a ≥ 0.01 v syntetických dátach (zabezpečenie aspoň minimálnej generalizácie počas tréningu). Tento dvojitý prahový prístup zmierňuje posun domény, pričom uprednostňuje signály z reálneho sveta a zároveň zachováva kompatibilitu so syntetickým tréningovým prostredím.
In [165]:
def permut_imp(data, selected_features, label, visualize=False):

    """Calculate Permutation Importance for given features."""

    try:
        X = data[selected_features]
        y = data['current_uc']
        X_scaled = StandardScaler().fit_transform(X)

        model = RandomForestClassifier(n_estimators=100, random_state=32)
        model.fit(X_scaled, y)

        result = permutation_importance(model, X_scaled, y, n_repeats=5, random_state=32)

        importance_df = pd.DataFrame({
            'Feature': selected_features,
            f'Permutation_Importance_{label}': result.importances_mean
        }).sort_values(by=f'Permutation_Importance_{label}', ascending=False)

        # Visualize the results
        if visualize:
            plt.figure(figsize=(10, 6))
            plt.barh(importance_df['Feature'], importance_df[f'Permutation_Importance_{label}'], color='skyblue')
            plt.xlabel("Permutation Importance")
            plt.title(f"Permutation Importance ({label})")
            plt.gca().invert_yaxis()
            plt.grid(True)
            plt.tight_layout()
            plt.show()

        return importance_df

    except Exception as e:
        print(f"❌ Error in permut_imp for {label}: {e}")
        return pd.DataFrame()
In [166]:
permut_imp(data, selected_features, "Synthetic", True)
No description has been provided for this image
Out[166]:
Feature Permutation_Importance_Synthetic
11 fivegs_amffunction_mm_confupdate_value 0.109630
7 fivegs_pcffunction_pa_policysmassosucc_value 0.097833
2 process_virtual_memory_bytes_value 0.086315
3 fivegs_amffunction_rm_reginitsucc_value 0.069795
4 fivegs_smffunction_sm_pdusessioncreationsucc_v... 0.056893
5 fivegs_smffunction_sm_qos_flow_nbr_value 0.040753
0 process_open_fds_value 0.034742
6 fivegs_pcffunction_pa_sessionnbr_value 0.024210
9 fivegs_amffunction_rm_registeredsubnbr_value 0.009258
1 s5c_rx_parse_failed_value 0.000000
8 fivegs_upffunction_sm_n4sessionreport_value 0.000000
10 fivegs_amffunction_rm_regemergreq_value 0.000000
In [167]:
permut_imp(real_data, selected_features, "Real", True)
No description has been provided for this image
Out[167]:
Feature Permutation_Importance_Real
6 fivegs_pcffunction_pa_sessionnbr_value 0.131496
8 fivegs_upffunction_sm_n4sessionreport_value 0.046986
11 fivegs_amffunction_mm_confupdate_value 0.044952
7 fivegs_pcffunction_pa_policysmassosucc_value 0.030685
5 fivegs_smffunction_sm_qos_flow_nbr_value 0.008342
4 fivegs_smffunction_sm_pdusessioncreationsucc_v... 0.007959
0 process_open_fds_value 0.000000
1 s5c_rx_parse_failed_value 0.000000
2 process_virtual_memory_bytes_value 0.000000
3 fivegs_amffunction_rm_reginitsucc_value 0.000000
9 fivegs_amffunction_rm_registeredsubnbr_value 0.000000
10 fivegs_amffunction_rm_regemergreq_value 0.000000
Permutation Importance (Real Data): Permutačná importance odhaľuje reálne silné znaky, ktoré môžu byť v syntetických dátach slabo zastúpené alebo chýbať.
- fivegs_upffunction_sm_n4sessionreport_value má najvyššiu permutačnú dôležitosť, hoci v syntetických dátach nemal žiaden význam – signalizuje možný bias syntetického datasetu.
- process_virtual_memory_bytes_value, ktorý bol vysoko v syntetike, má nulový prínos v realite.
- Prvky ako fivegs_pcffunction_pa_sessionnbr_value alebo fivegs_pcffunction_pa_policysmassosucc_value sa objavili v oboch dátach – potvrdzujú svoju robustnosť naprieč doménami.
In [ ]:
synthetic_perm = permut_imp(data, selected_features, "Synthetic")
real_perm = permut_imp(real_data, selected_features, "Real")

# Check if the permutation importance tables are empty
if synthetic_perm.empty or real_perm.empty:
    raise ValueError("❗ One of the permutation importance tables is empty. Check selected_features or data.")

merged = pd.merge(synthetic_perm, real_perm, on='Feature', how='inner')
merged = merged.merge(summary_df[['Feature', 'RF_Importance']], on='Feature', how='left')
merged = merged.rename(columns={'RF_Importance': 'RF_Importance_Synthetic'})
merged = merged.merge(summary_real_df[['Feature', 'RF_Importance']], on='Feature', how='left')
merged = merged.rename(columns={'RF_Importance': 'RF_Importance_Real'})

final_features = merged[
    (merged['Permutation_Importance_Real'] >= 0.001)
].sort_values(by='Permutation_Importance_Real', ascending=False)

final_features = final_features['Feature'].tolist()

# Add the log_type and application columns
new_features = ['log_type', 'application']
final_features = final_features + new_features

output = {"features": final_features}

with open('selected_features.json', 'w') as f:
    json.dump(output, f)

output
Out[ ]:
{'features': ['fivegs_pcffunction_pa_sessionnbr_value',
  'fivegs_upffunction_sm_n4sessionreport_value',
  'fivegs_amffunction_mm_confupdate_value',
  'fivegs_pcffunction_pa_policysmassosucc_value',
  'fivegs_smffunction_sm_qos_flow_nbr_value',
  'fivegs_smffunction_sm_pdusessioncreationsucc_value',
  'log_type',
  'application']}
In [ ]:
synthetic_perm = permut_imp(data, selected_features, "Synthetic")
real_perm = permut_imp(real_data, selected_features, "Real")

# Check if the permutation importance tables are empty
if synthetic_perm.empty or real_perm.empty:
    raise ValueError("❗ One of the permutation importance tables is empty. Check selected_features or data.")

merged = pd.merge(synthetic_perm, real_perm, on='Feature', how='inner')
merged = merged.merge(summary_df[['Feature', 'RF_Importance']], on='Feature', how='left')
merged = merged.rename(columns={'RF_Importance': 'RF_Importance_Synthetic'})
merged = merged.merge(summary_real_df[['Feature', 'RF_Importance']], on='Feature', how='left')
merged = merged.rename(columns={'RF_Importance': 'RF_Importance_Real'})

final_features = merged[
    (merged['Permutation_Importance_Real'] >= 0.001)
].sort_values(by='Permutation_Importance_Real', ascending=False)

final_features = final_features['Feature'].tolist()

# Add the log_type and application columns
new_features = ['log_type', 'application']
final_features = final_features + new_features

output = {"features": final_features}

output
Out[ ]:
{'features': ['fivegs_pcffunction_pa_sessionnbr_value',
  'fivegs_upffunction_sm_n4sessionreport_value',
  'fivegs_amffunction_mm_confupdate_value',
  'fivegs_pcffunction_pa_policysmassosucc_value',
  'fivegs_smffunction_sm_qos_flow_nbr_value',
  'fivegs_smffunction_sm_pdusessioncreationsucc_value',
  'log_type',
  'application']}
In [171]:
def permut_imp_stability(data, selected_features, label, n_runs=10):

    """Calculate the stability of Permutation Importance across multiple runs."""

    X = data[selected_features]
    y = data['current_uc']
    X_scaled = StandardScaler().fit_transform(X)

    importances = []

    for i in range(n_runs):
        model = RandomForestClassifier(n_estimators=100, random_state=i)
        model.fit(X_scaled, y)
        result = permutation_importance(model, X_scaled, y, n_repeats=5, random_state=i)
        importances.append(result.importances_mean)

    importances = np.array(importances)
    mean_importance = np.mean(importances, axis=0)
    std_importance = np.std(importances, axis=0)
    median_importance = np.median(importances, axis=0)

    stability_df = pd.DataFrame({
        'Feature': selected_features,
        f'PI_Mean_{label}': mean_importance,
        f'PI_Std_{label}': std_importance,
        f'PI_Median_{label}': median_importance
    }).sort_values(by=f'PI_Median_{label}', ascending=False)

    # Visualize the results
    plt.figure(figsize=(10, 6))
    plt.barh(stability_df['Feature'], stability_df[f'PI_Median_{label}'],
             xerr=stability_df[f'PI_Std_{label}'], color='skyblue')
    plt.xlabel("Median Permutation Importance (± std)")
    plt.title(f"Permutation Importance Stability ({label}) across {n_runs} runs")
    plt.gca().invert_yaxis()
    plt.grid(True)
    plt.tight_layout()
    plt.show()

    return stability_df
In [ ]:
stability_synthetic = permut_imp_stability(data, selected_features, "Synthetic")
stability_real = permut_imp_stability(real_data, selected_features, "Real")
No description has been provided for this image
No description has been provided for this image
In [173]:
print(stability_synthetic)
print(stability_real)
                                              Feature  PI_Mean_Synthetic  \
3             fivegs_amffunction_rm_reginitsucc_value           0.089895   
11             fivegs_amffunction_mm_confupdate_value           0.080016   
7        fivegs_pcffunction_pa_policysmassosucc_value           0.076676   
2                  process_virtual_memory_bytes_value           0.071801   
5            fivegs_smffunction_sm_qos_flow_nbr_value           0.050805   
4   fivegs_smffunction_sm_pdusessioncreationsucc_v...           0.049039   
0                              process_open_fds_value           0.037172   
6              fivegs_pcffunction_pa_sessionnbr_value           0.020860   
9        fivegs_amffunction_rm_registeredsubnbr_value           0.009667   
1                           s5c_rx_parse_failed_value           0.000000   
8         fivegs_upffunction_sm_n4sessionreport_value           0.000000   
10            fivegs_amffunction_rm_regemergreq_value           0.000000   

    PI_Std_Synthetic  PI_Median_Synthetic  
3           0.015386             0.085924  
11          0.009668             0.080594  
7           0.010849             0.075319  
2           0.009972             0.072008  
5           0.007041             0.051682  
4           0.005301             0.048636  
0           0.005167             0.039325  
6           0.001584             0.020749  
9           0.000708             0.009570  
1           0.000000             0.000000  
8           0.000000             0.000000  
10          0.000000             0.000000  
                                              Feature  PI_Mean_Real  \
6              fivegs_pcffunction_pa_sessionnbr_value      0.126382   
8         fivegs_upffunction_sm_n4sessionreport_value      0.050429   
7        fivegs_pcffunction_pa_policysmassosucc_value      0.025359   
11             fivegs_amffunction_mm_confupdate_value      0.015584   
4   fivegs_smffunction_sm_pdusessioncreationsucc_v...      0.010111   
5            fivegs_smffunction_sm_qos_flow_nbr_value      0.008808   
3             fivegs_amffunction_rm_reginitsucc_value      0.001665   
0                              process_open_fds_value      0.000000   
1                           s5c_rx_parse_failed_value      0.000000   
2                  process_virtual_memory_bytes_value      0.000000   
9        fivegs_amffunction_rm_registeredsubnbr_value      0.000000   
10            fivegs_amffunction_rm_regemergreq_value      0.000000   

    PI_Std_Real  PI_Median_Real  
6      0.022317        0.131629  
8      0.005488        0.048917  
7      0.017205        0.022682  
11     0.015349        0.010877  
4      0.002631        0.010169  
5      0.003259        0.010155  
3      0.002013        0.000103  
0      0.000000        0.000000  
1      0.000000        0.000000  
2      0.000000        0.000000  
9      0.000000        0.000000  
10     0.000000        0.000000  
In [175]:
# Check if the permutation importance tables are empty
if stability_synthetic.empty or stability_real.empty:
    raise ValueError("❗ One of the permutation importance tables is empty. Check selected_features or data.")

merged_stability = pd.merge(stability_synthetic, stability_real, on='Feature', how='inner')
merged_stability = merged_stability.merge(summary_df[['Feature', 'RF_Importance']], on='Feature', how='left')
merged_stability = merged_stability.rename(columns={'RF_Importance': 'RF_Importance_Synthetic'})
merged_stability = merged_stability.merge(summary_real_df[['Feature', 'RF_Importance']], on='Feature', how='left')
merged_stability = merged_stability.rename(columns={'RF_Importance': 'RF_Importance_Real'})
merged_stability = merged_stability.rename(columns={
    'PI_Mean_Synthetic': 'PI_Mean_Synthetic',
    'PI_Std_Synthetic': 'PI_Std_Synthetic',
    'PI_Median_Synthetic': 'PI_Median_Synthetic',
    'PI_Mean_Real': 'PI_Mean_Real',
    'PI_Std_Real': 'PI_Std_Real',
    'PI_Median_Real': 'PI_Median_Real'
})

final_features_stability = merged_stability[
    (merged_stability['PI_Median_Real'] >= 0.001)
].sort_values(by='PI_Median_Real', ascending=False)
final_features_stability = final_features_stability['Feature'].tolist()

# Add the log_type and application columns
new_features = ['log_type', 'application']

final_features_stability = final_features_stability + new_features

output_stability = {"features": final_features_stability}
with open('./json/selected_features.json', 'w') as f:
    json.dump(output, f)

output_stability
Out[175]:
{'features': ['fivegs_pcffunction_pa_sessionnbr_value',
  'fivegs_upffunction_sm_n4sessionreport_value',
  'fivegs_pcffunction_pa_policysmassosucc_value',
  'fivegs_amffunction_mm_confupdate_value',
  'fivegs_smffunction_sm_pdusessioncreationsucc_value',
  'fivegs_smffunction_sm_qos_flow_nbr_value',
  'log_type',
  'application']}
Finálny výber znakov (cross-domain validovaný): Na základe kombinácie Random Forest a Permutation Importance v syntetických aj reálnych dátach sme zvolili znaky, ktoré:
✔ Sú informatívne v syntetike (trénovateľné)
✔ Majú reálnu dôležitosť (reálne použiteľné)

Finálny výber (8 znakov):
fivegs_pcffunction_pa_sessionnbr_value
fivegs_upffunction_sm_n4sessionreport_value
fivegs_pcffunction_pa_policysmassosucc_value
fivegs_amffunction_mm_confupdate_value
fivegs_smffunction_sm_pdusessioncreationsucc_value
fivegs_smffunction_sm_qos_flow_nbr_value
log_type
application
Týmto výberom minimalizujeme doménový bias a maximalizujeme robustnosť pri generalizácii z trénovania na syntetike do reálneho prostredia.
Finálny výber znakov (cross-domain validovaný): Finálny výber znakov je založený na kombinácii permutačnej dôležitosti v syntetických a reálnych dátach. Cieľom je zabezpečiť, že výber reflektuje reálne správanie siete, ale zároveň je trénovateľný na syntetických dátach. Do finálnej množiny sme zaradili iba tie znaky, ktoré vykazovali konzistentný informačný prínos naprieč oboma doménami. Týmto prístupom sme eliminovali znaky, ktoré sú síce dominantné v syntetickom prostredí, ale nereprezentujú realitu (napr. process_virtual_memory_bytes_value), čím sa znižuje riziko doménového biasu.

Referencie ¶

  1. NGUYEN, Giang. Introduction to Data Science. 1. vyd. Bratislava: Slovak University of Technology in Bratislava, 2022. ISBN 978-80-227-5193-3.

  2. Alejopaullier (2024) Make your notebooks look better. https://www.kaggle.com/code/alejopaullier/make-your-notebooks-look-better.

  3. Huang, N., Lu, G. and Xu, D., 2016. A permutation importance-based feature selection method for short-term electricity load forecasting using random forest. Energies, 9(10), p.767. Available at: https://doi.org/10.3390/en9100767